R is a script-based language. You write down a list of instructions and it will follow, performing one action after another. This is different to ‘point and click’ software like Microsoft Excel, and it can feel a bit cumbersome.
In Excel, you can perform a series of steps:
But, there is no record of any of this. You might look back at the data in a week’s time and not know what rows you have deleted. This is particularly important when you go to write up your metholodolgy for an assignment or thesis. You might not be able to recreate your own work.
R is a script-based program. You write a list of instructions and R will follow it. This is wonderfully handy:
There are many proprietary script-based programs for analysing data: Stata, SAS, Eviews, SPSS, Matlab.
They cost $$$. If you are not at a university/workplace that has the program, you can’t use it. Or you will have to pay for a license yourself.
R is free, open-source and powerful. In the past five years, it has also become easier to use.
Proprietary programs are also centrally controlled.
The select functions are written by the company, and you can only use the set of functions they provide.
R thrives on user-written packages (collections of functions) that are available to everyone, for free. From a recent study:
In 2015, R added 1,357 packages, counting only CRAN, or approximately 27,642 functions. During 2015 alone, R added more functions than SAS Institute has written in its entire history.1
You will need to download and install R and R Studio.
R is the language and the program. Think of it as the engine that powers the things you do. You can download it for:
Once downloaded, follow the prompts to install. Restart your computer if required.
R Studio is the interface you will use R with. The technical term is an ‘integrated development environment’ (IDE) for R. Think of it as the dashboard that shows you all the things you’ve got going on in R. You can download it for:
Then, follow the prompts to install and restart if required.
Good folder structure is tedious and abstract and not-at-all-fun but it makes everything in the future easier. It simply means you have:
introduction_to_R. But your projects will likely be econometrics_assignment2 or honours_thesis. Whatever the project, everything you need for the it is contained within the folder.data. Keep output (tables, charts, etc) in an output folder. Note that you can set these up how you like: but consistency makes it easier for you to switch between projects.Your script will often ask for things on the computer. For example, “read in this dataset” or “save this chart to a place”. For that, we have to tell the computer where it is.
A bad way to tell your computer where it is:
This is sometimes done by ‘setting a working directory’. This means having a line in your script that says ‘this is where we are’:
This is problematic and frustrating, especially if you are collaborating. Your directory path won’t be the same as your collaborators or tutors (unless you have the same name and the same operating system!).
From Hadley Wickham’s excellent R for Data Science:
But you should never do this because there’s a better way; a way that also puts you on the path to managing your R work like an expert.
A better way to tell your computer where it is:
The best way to tell your computer where it is is to use R Projects. This is a little file that lives in your project folder with the suffix .Rproj. Opening this file opens R and sets your working directory to where it is.
This is beneficial because it means you don’t have to write setwd("Users/yourname/Documents/myRfolder/this_project_of_mine") on every single script you write. It also means that your collaborators can open your project folder on their computer and all scripts will run without a hitch.
R Studio is an integrated development environment (IDE) and is how we will interact with R. It looks like this:
The four panes are labelled in \(\color{green}{\text{green}}\):
I know this call all look a bit intimidating the first time you see it. That’s okay! We’ll get to know R Studio more.
An
A function takes inputs (arguments) and produces outputs.
We can use the c function to combine (concatenate) numbers into a series of numbers (a vector):
## [1] 3 4 4
The output above, like all output in this document is preceded by ## and then [1], meaning the first line of output. Here, the output is the vector of numbers we entered into the c function.
We can also nest functions, meaning we have one function inside another function. For example, we can combine numbers into a vector using the c function, then we can take the average (mean) of the vector:
# Use the c function to combine numers (input) into a vector (output)
# Then take the mean of that vector:
mean(c(3, 4, 5))## [1] 4
But nested functions are a bit difficult to read. You have to start from the inside and read outwards. Alternatively, we could assign our vector to an object using the assign <-operator:
# Use the c function to combine numers (input) into a vector (output)
# And assign that to the object 'goodnumbers'
goodnumbers <- c(3, 4, 5)
# Then take the mean of goodnumbers
mean(goodnumbers)## [1] 4
This will make changes in our Environment: it adds the object goodnumbers. It will also produce output in the Console: the mean of goodnumbers. It should look something like this:
Installing a package is like installing an app on your phone or computer: you need to do it, and you only need to do it once.
You can install a package using the install.packages function. Note that there will be lots of text that appears when installing a package
Now we need to load it using the library function; like opening an app you have installed on your phone. We do this every time (every ‘session’) we want to use it.
Notice below that we received some messages and warnings when we loaded the
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────── tidyverse 1.2.1.9000 ──
## ✔ ggplot2 3.1.0 ✔ purrr 0.3.0
## ✔ tibble 2.0.1 ✔ dplyr 0.8.0.1
## ✔ tidyr 0.8.2 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## Warning: package 'tibble' was built under R version 3.5.2
## Warning: package 'purrr' was built under R version 3.5.2
## Warning: package 'dplyr' was built under R version 3.5.2
## Warning: package 'stringr' was built under R version 3.5.2
## Warning: package 'forcats' was built under R version 3.5.2
## ── Conflicts ────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()This
This uses the read_csv function and, here, we’re only going to give it one argument: the path to the csv file you want to read in quotation marks.
Tip: open quotation marks and hit tab to choose your file (and save you some typing).
## Parsed with column specification:
## cols(
## country = col_character(),
## continent = col_character(),
## year = col_double(),
## lifeExp = col_double(),
## pop = col_double(),
## gdpPercap = col_double()
## )
## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # … with 1,694 more rows
Looks good! But it isn’t in our Environment (on the right) yet because we didn’t assign it to anything. We assign something using <-. It
## Parsed with column specification:
## cols(
## country = col_character(),
## continent = col_character(),
## year = col_double(),
## lifeExp = col_double(),
## pop = col_double(),
## gdpPercap = col_double()
## )
Now it is in our Global Environment over there —> wooh!
Much like Excel, we can explore the gapminder dataset with our eyes.
View will open up a new tab that displays your dataset. You can scroll through it.
head will print just the first few observations. This is handy to check on things as you’re going along.
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
names will display the names of all variables in the dataset (and is often the answe to ‘what was that variable called again…’)
## [1] "country" "continent" "year" "lifeExp" "pop" "gdpPercap"
plotlyNow close your eyes and picture the gapminder dataset: * Add a new column to the right with the name ‘my_column’. * Only keep rows from 2007 * Then remove the ‘year’ column
gapminder07 <- gapminder %>% # Assign gapminder07 to: the gapminder dataset, then
mutate(gdp = gdpPercap * pop) %>% # create a new column called gdp, then
filter(year == 2007) %>% # keep only observations from 2007, then
select(-gdpPercap) # drop the gdpPercap variable (negative select)We want to make our plots as clear as possible…
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
# with a log scale
gapminder07 %>%
ggplot(aes(x = lifeExp,
y = gdp)) +
geom_point() +
scale_y_log10(label = comma)# with colour
gapminder07 %>%
ggplot(aes(x = lifeExp,
y = gdp,
colour = continent)) +
geom_point() +
geom_line(aes(group = country)) +
scale_y_log10(label = comma)## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
# with colour and facet
gapminder07 %>%
ggplot(aes(x = lifeExp,
y = gdp,
colour = continent)) +
geom_point() +
geom_line(aes(group = country)) +
scale_y_log10(label = comma)+
facet_wrap(~ continent)## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
# with colour and facet
gapminder07 %>%
mutate(decade = signif(year, 3)) %>%
ggplot(aes(x = lifeExp,
y = gdp,
colour = continent,
size = pop)) +
geom_point() +
geom_line(aes(group = country)) +
scale_y_log10(label = comma)+
facet_grid(decade ~ continent)## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
# with colour
gapminder07 %>%
ggplot(aes(x = lifeExp,
y = gdp,
colour = continent)) +
geom_point(alpha = 0.5) +
geom_line(aes(group = country)) +
scale_y_log10() +
facet_wrap(~ continent)## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?
At some point throughout your university life you will need to write equations in a document.
$A = (r^{4}) / $
"Read a .csv file from the path "data/gapminder.csv"
r4stats.com/articles/popularity/ (the potential bias is indicated in its domain)↩